Introduction

Quick statistic based and univariate audit of the different columns

This US Census dataset contains detailed but anonymised information for approximately 300,000 people.

Metadata

This data was extracted from the census bureau database found at http://www.census.gov/ftp/pub/DES/www/welcome.html Donor: Terran Lane and Ronny Kohavi Data Mining and Visualization e-mail: terran@ecn.purdue.edu, ronnyk@sgi.com for questions.

The data was split into train/test in approximately 2/3, 1/3 proportions using MineSet’s MIndUtil mineset-to-mlc.

Prediction task is to determine the income level for the person represented by the record. Incomes have been binned at the $50K level to present a binary classification problem, much like the original UCI/ADULT database. The goal field of this data, however, was drawn from the “total person income” field rather than the “adjusted gross income” and may, therefore, behave differently than the orginal ADULT goal field.

More information detailing the meaning of the attributes can be found in http://www.bls.census.gov/cps/cpsmain.htm To make use of the data descriptions at this site, the following mappings to the Census Bureau’s internal database column names will be needed:

Data descriptions Database column names
age AAGE
class of worker ACLSWKR
industry code ADTIND
occupation code ADTOCC
adjusted gross income AGI
education AHGA
wage per hour AHRSPAY
enrolled in edu inst last wk AHSCOL
marital status AMARITL
major industry code AMJIND
major occupation code AMJOCC
mace ARACE
hispanic Origin AREORGN
sex ASEX
member of a labor union AUNMEM
reason for unemployment AUNTYPE
full or part time employment stat AWKSTAT
capital gains CAPGAIN
capital losses CAPLOSS
divdends from stocks DIVVAL
federal income tax liability FEDTAX
tax filer status FILESTAT
region of previous residence GRINREG
state of previous residence GRINST
detailed household and family stat HHDFMX
detailed household summary in household HHDREL
instance weight MARSUPWT
migration code-change in msa MIGMTR1
migration code-change in reg MIGMTR3
migration code-move within reg MIGMTR4
live in this house 1 year ago MIGSAME
migration prev res in sunbelt MIGSUN
num persons worked for employer NOEMP
family members under 18 PARENT
total person earnings PEARNVAL
country of birth father PEFNTVTY
country of birth mother PEMNTVTY
country of birth self PENATVTY
citizenship PRCITSHP
total person income PTOTVAL
own business or self employed SEOTR
taxable income amount TAXINC
fill inc questionnaire for veteran’s admin VETQVA
veterans benefits VETYN
weeks worked in year WKSWORK

Basic statistics for this data set:

Number of instances data = 199523

Duplicate or conflicting instances : 46716

Number of instances in test = 99762

Duplicate or conflicting instances : 20936

Class probabilities for income-projected.test file

Probability for the label ‘- 50000’ : 93.80%

Probability for the label ‘50000+’ : 6.20%

Majority accuracy: 93.80% on value - 50000

Number of attributes = 40 (continuous : 7 nominal : 33)

# import the learning data set into da dataframe
df <- read.csv('data/census_income_learn.csv', header = FALSE)
label <- df[[42]] == " 50000+."

We are going to look at column one by one

Analysis

Age

Metadata

91 distinct values for attribute #0 (age) continuous

Statistics

ages <- df[[1]]
summary(ages)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   33.00   34.49   50.00   90.00
plot(density(ages))

boxplot(ages, ages[label], ages[!label], horizontal = TRUE, names=c("global", "gt50000", "lt50000"))

We can see a difference between the distribution of people with an imcome greater than 50K.

Class of worker

Metadata

9 distinct values for attribute #1 (class of worker) nominal

class of worker: Not in universe, Federal government, Local government, Never worked, Private, Self-employed-incorporated, Self-employed-not incorporated, State government, Without pay.

Statistics

class.of.worker <- df[[2]]
par(mar=c(5.1, 13 ,4.1 ,2.1))
barplot(sort(prop.table(summary(class.of.worker[label]))),las=1, horiz = TRUE)

barplot(sort(prop.table(summary(class.of.worker[!label]))),las=1, horiz = TRUE)

private, Self-employed-incorporated, Not in universe, gouvernements (concat of 3 gouv) looks to have

Detailed industry recode

Metadata

52 distinct values for attribute #2 (detailed industry recode) nominal

detailed industry recode: 0, 40, 44, 2, 43, 47, 48, 1, 11, 19, 24, 25, 32, 33, 34, 35, 36, 37, 38, 39, 4, 42, 45, 5, 15, 16, 22, 29, 31, 50, 14, 17, 18, 28, 3, 30, 41, 46, 51, 12, 13, 21, 23, 26, 6, 7, 9, 49, 27, 8, 10, 20.

Statistics

industry.code <- as.factor(df[[3]])
par(mar=c(5.1, 5 ,4.1 ,2.1))
barplot(prop.table(summary(industry.code[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(industry.code[!label])),las=1, horiz = TRUE)

0 is mostly for - 50000 we can diferenciate 0 and non 0

Detailed occupation recode (ignored in model)

Metadata

47 distinct values for attribute #3 (detailed occupation recode) nominal

detailed occupation recode: 0, 12, 31, 44, 19, 32, 10, 23, 26, 28, 29, 42, 40, 34, 14, 36, 38, 2, 20, 25, 37, 41, 27, 24, 30, 43, 33, 16, 45, 17, 35, 22, 18, 39, 3, 15, 13, 46, 8, 21, 9, 4, 6, 5, 1, 11, 7.

Statistics

occupation.code <- as.factor(df[[4]])
par(mar=c(5.1, 5 ,4.1 ,2.1))
barplot(prop.table(summary(occupation.code[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(occupation.code[!label])),las=1, horiz = TRUE)

Same observation the big diference is the proportion of 0

Education

Metadata

17 distinct values for attribute #4 (education) nominal

education: Children, 7th and 8th grade, 9th grade, 10th grade, High school graduate, 11th grade, 12th grade no diploma, 5th or 6th grade, Less than 1st grade, Bachelors degree(BA AB BS), 1st 2nd 3rd or 4th grade, Some college but no degree, Masters degree(MA MS MEng MEd MSW MBA), Associates degree-occup /vocational, Associates degree-academic program, Doctorate degree(PhD EdD), Prof school degree (MD DDS DVM LLB JD).

Statistics

education <- as.factor(df[[5]])
par(mar=c(5.1, 19,4.1 ,2.1))
barplot(prop.table(summary(education[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(education[!label])),las=1, horiz = TRUE)

there is two gategory:

  • Before Hight school: Children, 7th and 8th grade, 9th grade, 10th grade, High school graduate, 11th grade, 12th grade no diploma, 5th or 6th grade, Less than 1st grade, 1st 2nd 3rd or 4th grade

  • Post graduate: Bachelors degree(BA AB BS), Masters degree(MA MS MEng MEd MSW MBA), Associates degree-occup /vocational, Associates degree-academic program, Doctorate degree(PhD EdD), Prof school degree (MD DDS DVM LLB JD).

Some college but no degree is in the same proportion in both categories

Wage per hour

Metadata

1240 distinct values for attribute #5 (wage per hour) continuous

wage per hour: continuous. ### Statistics

wage.per.hour <- df[[6]]
summary(wage.per.hour[label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   81.64    0.00 9999.00
summary(wage.per.hour[!label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   53.69    0.00 9916.00
par(mar=c(5.1, 5,4.1 ,2.1))
plot(density(wage.per.hour))

boxplot(wage.per.hour[label], wage.per.hour[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))

This variable should be combine with another one.

Enroll in edu inst last wk (ignored in model)

Metadata

3 distinct values for attribute #6 (enroll in edu inst last wk) nominal

enroll in edu inst last wk: Not in universe, High school, College or university.

Statistics

enroll.in.edu.inst.last.wk <- df[[7]]
par(mar=c(5.1, 10 ,4.1 ,2.1))
barplot(prop.table(summary(enroll.in.edu.inst.last.wk[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(enroll.in.edu.inst.last.wk[!label])),las=1, horiz = TRUE)

This variable doesn’t very important we may ignore it at first.

Marital stat

Metadata

7 distinct values for attribute #7 (marital stat) nominal

marital stat: Never married, Married-civilian spouse present, Married-spouse absent, Separated, Divorced, Widowed, Married-A F spouse present.

Statistics

marital.stat <- df[[8]]
par(mar=c(5.1, 13 ,4.1 ,2.1))
barplot(prop.table(summary(marital.stat[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(marital.stat[!label])),las=1, horiz = TRUE)

Never married, Married-civillian spouse present, others

Major industry code (ignored in model)

Metadata

24 distinct values for attribute #8 (major industry code) nominal

major industry code: Not in universe or children, Entertainment, Social services, Agriculture, Education, Public administration, Manufacturing-durable goods, Manufacturing-nondurable goods, Wholesale trade, Retail trade, Finance insurance and real estate, Private household services, Business and repair services, Personal services except private HH, Construction, Medical except hospital, Other professional services, Transportation, Utilities and sanitary services, Mining, Communications, Hospital services, Forestry and fisheries, Armed Forces.

Statistics

major.industry.code <- df[[9]]
par(mar=c(5.1, 15 ,4.1 ,2.1))
barplot(prop.table(summary(major.industry.code[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(major.industry.code[!label])),las=1, horiz = TRUE)

# compare major industry code not in universe and industry code 0
major.industry.code == ' Not in universe or children' && industry.code == 0
## [1] TRUE

This field is a duplicate from industry code

Major occupation code

Metadata

15 distinct values for attribute #9 (major occupation code) nominal

major occupation code: Not in universe, Professional specialty, Other service, Farming forestry and fishing, Sales, Adm support including clerical, Protective services, Handlers equip cleaners etc , Precision production craft & repair, Technicians and related support, Machine operators assmblrs & inspctrs, Transportation and material moving, Executive admin and managerial, Private household services, Armed Forces.

Statistics

major.occupation.code <- df[[10]]
par(mar=c(5.1, 16 ,4.1 ,2.1))
barplot(prop.table(summary(major.occupation.code[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(major.occupation.code[!label])),las=1, horiz = TRUE)

# compare major occupation code not in universe and occupation code 0
major.occupation.code == ' Not in universe' && occupation.code == 0
## [1] TRUE
# compare major occupation code not in universe and weeks worked in year
major.occupation.code == " Not in universe" && df[[40]] == 0
## [1] TRUE

we can create a dummie variable for Sales, Professional specialty, Executive admin and managerial

Race

Metadata

5 distinct values for attribute #10 (race) nominal

race: White, Black, Other, Amer Indian Aleut or Eskimo, Asian or Pacific Islander. ### Statistics

race <- df[[11]]
par(mar=c(5.1,12,4.1,2.1))
barplot(prop.table(summary(race[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(race[!label])),las=1, horiz = TRUE)

we can divide in 2 categries:

  • White, Asian.

  • Black, Other, Amer Indian Aleut or Eskimo or Pacific Islander

Hispanic origin

hispanic origin: Mexican (Mexicano), Mexican-American, Puerto Rican, Central or South American, All other, Other Spanish, Chicano, Cuban, Do not know, NA.

Metadata

10 distinct values for attribute #11 (hispanic origin) nominal

Statistics

hispanic.origin <- df[[12]]
par(mar=c(5.1,12,4.1,2.1))
barplot(prop.table(summary(hispanic.origin[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(hispanic.origin[!label])),las=1, horiz = TRUE)

we can be merge with race in the same groupe as Black, Other, Amer Indian Aleut or Eskimo or Pacific Islander.

Sex

Metadata

2 distinct values for attribute #12 (sex) nominal

sex: Female, Male.

Statistics

sex <- df[[13]]
par(mar=c(5.1,5,4.1,2.1))
barplot(prop.table(summary(sex[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(sex[!label])),las=1, horiz = TRUE)

We can keep this 2 variable

Member of a labor union (ignored in model)

Metadata

3 distinct values for attribute #13 (member of a labor union) nominal

member of a labor union: Not in universe, No, Yes.

Statistics

member.labor.union <- df[[14]]
par(mar=c(5.1,8,4.1,2.1))
barplot(prop.table(summary(member.labor.union[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(member.labor.union[!label])),las=1, horiz = TRUE)

This variable doesn’t very important we may ignore it at first.

Reason for unemployment (ignored in model)

Metadata

6 distinct values for attribute #14 (reason for unemployment) nominal

reason for unemployment: Not in universe, Re-entrant, Job loser - on layoff, New entrant, Job leaver, Other job loser.

Statistics

reason.unemployement <- df[[15]]
par(mar=c(5.1,10,4.1,2.1))
barplot(prop.table(summary(reason.unemployement[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(reason.unemployement[!label])),las=1, horiz = TRUE)

This variable doesn’t very important we may ignore it at first.

Full or part time employment stat

Metadata

8 distinct values for attribute #15 (full or part time employment stat) nominal

full or part time employment stat: Children or Armed Forces, Full-time schedules, Unemployed part- time, Not in labor force, Unemployed full-time, PT for non-econ reasons usually FT, PT for econ reasons usually PT, PT for econ reasons usually FT.

Statistics

full.part.employment.stat <- df[[16]]
par(mar=c(5.1,15,4.1,2.1))
barplot(prop.table(summary(full.part.employment.stat[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(full.part.employment.stat[!label])),las=1, horiz = TRUE)

we can divide in 2 categries:

  • Children or Armed Forces, Not in labor force
  • Full-time schedules

Capital gains

Metadata

132 distinct values for attribute #16 (capital gains) continuous

capital gains: continuous.

Statistics

capital.gains <- df[[17]]
summary(capital.gains[label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    4831       0  100000
summary(capital.gains[!label])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    143.8      0.0 100000.0
par(mar=c(5.1, 5,4.1 ,2.1))
plot(density(capital.gains[label]))

plot(density(capital.gains[!label]))

boxplot(capital.gains[label], capital.gains[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))

We may combine with other variables

Capital losses

Metadata

113 distinct values for attribute #17 (capital losses) continuous

capital losses: continuous.

Statistics

capital.losses <- df[[18]]
summary(capital.losses[label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   193.1     0.0  3683.0
summary(capital.losses[!label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0      27       0    4608
par(mar=c(5.1, 5,4.1 ,2.1))
plot(density(capital.losses[label]))

plot(density(capital.losses[!label]))

boxplot(capital.losses[label], capital.losses[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))

We may combine with other variables

Dividends from stocks

Metadata

1478 distinct values for attribute #18 (dividends from stocks) continuous

dividends from stocks: continuous.

Statistics

dividends.stocks <- df[[19]]
summary(dividends.stocks[label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1553     363  100000
summary(dividends.stocks[!label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   107.8     0.0 39000.0
par(mar=c(5.1, 5,4.1 ,2.1))
plot(density(dividends.stocks[label]))

plot(density(dividends.stocks[!label]))

boxplot(dividends.stocks[label], dividends.stocks[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))

We may combine with other variables

Tax filer stat

Metadata

6 distinct values for attribute #19 (tax filer stat) nominal

tax filer stat: Nonfiler, Joint one under 65 & one 65+, Joint both under 65, Single, Head of household, Joint both 65+.

Statistics

tax.filer.stat <- df[[20]]
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary(tax.filer.stat[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(tax.filer.stat[!label])),las=1, horiz = TRUE)

2 categories:

  • Nonfiler

  • Joint both under 65

Region of previous residence (ignored in model)

Metadata

6 distinct values for attribute #20 (region of previous residence) nominal

region of previous residence: Not in universe, South, Northeast, West, Midwest, Abroad.

Statistics

region.previous.residence <- df[[21]]
par(mar=c(5.1,8,4.1,2.1))
barplot(prop.table(summary(region.previous.residence[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(region.previous.residence[!label])),las=1, horiz = TRUE)

This variable doesn’t very important we may ignore it at first.

State of previous residence (ignored in model)

Metadata

51 distinct values for attribute #21 (state of previous residence) nominal

state of previous residence: Not in universe, Utah, Michigan, North Carolina, North Dakota, Virginia, Vermont, Wyoming, West Virginia, Pennsylvania, Abroad, Oregon, California, Iowa, Florida, Arkansas, Texas, South Carolina, Arizona, Indiana, Tennessee, Maine, Alaska, Ohio, Montana, Nebraska, Mississippi, District of Columbia, Minnesota, Illinois, Kentucky, Delaware, Colorado, Maryland, Wisconsin, New Hampshire, Nevada, New York, Georgia, Oklahoma, New Mexico, South Dakota, Missouri, Kansas, Connecticut, Louisiana, Alabama, Massachusetts, Idaho, New Jersey.

Statistics

state.previous.residence <- gsub("?",NA,df[[22]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(sort(prop.table(summary.factor(state.previous.residence[label])))[!names(sort(prop.table(summary.factor(state.previous.residence[label])))) %in% c("NA's", " Not in universe")],las=1, horiz = TRUE)

barplot(sort(prop.table(summary.factor(state.previous.residence[!label])))[!names(sort(prop.table(summary.factor(state.previous.residence[!label])))) %in% c("NA's", " Not in universe")],las=1, horiz = TRUE)

#ratio of Not in the universe and not applicable
prop.table(summary.factor(state.previous.residence[label]))[names(summary.factor(state.previous.residence[label])) %in% c("NA's", " Not in universe")]
##  Not in universe             NA's 
##      0.950088839      0.003634308
prop.table(summary.factor(state.previous.residence[!label]))[names(summary.factor(state.previous.residence[!label])) %in% c("NA's", " Not in universe")]
##  Not in universe             NA's 
##      0.919018280      0.003542783

This variable doesn’t very important we may ignore it at first.

Detailed household and family stat

Metadata

38 distinct values for attribute #22 (detailed household and family stat) nominal

detailed household and family stat: Child <18 never marr not in subfamily, Other Rel <18 never marr child of subfamily RP, Other Rel <18 never marr not in subfamily, Grandchild <18 never marr child of subfamily RP, Grandchild <18 never marr not in subfamily, Secondary individual, In group quarters, Child under 18 of RP of unrel subfamily, RP of unrelated subfamily, Spouse of householder, Householder, Other Rel <18 never married RP of subfamily, Grandchild <18 never marr RP of subfamily, Child <18 never marr RP of subfamily, Child <18 ever marr not in subfamily, Other Rel <18 ever marr RP of subfamily, Child <18 ever marr RP of subfamily, Nonfamily householder, Child <18 spouse of subfamily RP, Other Rel <18 spouse of subfamily RP, Other Rel <18 ever marr not in subfamily, Grandchild <18 ever marr not in subfamily, Child 18+ never marr Not in a subfamily, Grandchild 18+ never marr not in subfamily, Child 18+ ever marr RP of subfamily, Other Rel 18+ never marr not in subfamily, Child 18+ never marr RP of subfamily, Other Rel 18+ ever marr RP of subfamily, Other Rel 18+ never marr RP of subfamily, Other Rel 18+ spouse of subfamily RP, Other Rel 18+ ever marr not in subfamily, Child 18+ ever marr Not in a subfamily, Grandchild 18+ ever marr not in subfamily, Child 18+ spouse of subfamily RP, Spouse of RP of unrelated subfamily, Grandchild 18+ ever marr RP of subfamily, Grandchild 18+ never marr RP of subfamily, Grandchild 18+ spouse of subfamily RP.

Statistics

household.family.stat <- df[[23]]
par(mar=c(5.1,19,4.1,2.1))
barplot(prop.table(summary(household.family.stat[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(household.family.stat[!label])),las=1, horiz = TRUE)

2 categories:

  • householders

  • non householders

Detailed household summary in household (ignored in model)

Metadata

8 distinct values for attribute #23 (detailed household summary in household) nominal

detailed household summary in household: Child under 18 never married, Other relative of householder, Nonrelative of householder, Spouse of householder, Householder, Child under 18 ever married, Group Quarters- Secondary individual, Child 18 or older.

Statistics

household.summary.household <- df[[24]]
par(mar=c(5.1,16,4.1,2.1))
barplot(prop.table(summary(household.summary.household[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(household.summary.household[!label])),las=1, horiz = TRUE)

This variable doesn’t very important we may ignore it at first because redundant with the previous one.

Instance weight (ignored in model)

Metadata

| instance weight: ignore. instance weight: continuous.

Statistics

instance.weight <- df[[25]]
summary(instance.weight[label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.82 1123.00 1684.00 1796.00 2241.00 8433.00
summary(instance.weight[!label])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    37.87  1057.00  1613.00  1737.00  2186.00 18660.00
plot(density(instance.weight[label]))

plot(density(instance.weight[!label]))

boxplot(instance.weight[label], instance.weight[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))

This variable doesn’t very important we may ignore it at first.

Migration code-change in msa (ignored in model)

Metadata

9 distinct values for attribute #25 (migration code-change in reg) nominal

migration code-change in msa: Not in universe, Nonmover, MSA to MSA, NonMSA to nonMSA, MSA to nonMSA, NonMSA to MSA, Abroad to MSA, Not identifiable, Abroad to nonMSA.

Statistics

migration.code.msa <- gsub("?",NA,df[[26]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(migration.code.msa[label])),las=1, horiz = TRUE)

barplot(prop.table(summary.factor(migration.code.msa[!label])),las=1, horiz = TRUE)

prop.table(summary.factor(migration.code.msa[label]))[names(prop.table(summary.factor(migration.code.msa[label]))) %in% c("NA's", " Nonmover")]
##  Nonmover      NA's 
## 0.4216605 0.5284284
prop.table(summary.factor(migration.code.msa[!label]))[names(prop.table(summary.factor(migration.code.msa[!label]))) %in% c("NA's", " Nonmover")]
##  Nonmover      NA's 
## 0.4131484 0.4977691

This variable doesn’t very important we may ignore it at first.

Migration code-change in reg (ignored in model)

Metadata

9 distinct values for attribute #25 (migration code-change in reg) nominal

migration code-change in reg: Not in universe, Nonmover, Same county, Different county same state, Different state same division, Abroad, Different region, Different division same region.

Statistics

migration.code.change.reg <- gsub("?",NA,df[[27]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(migration.code.change.reg[label])),las=1, horiz = TRUE)

barplot(prop.table(summary.factor(migration.code.change.reg[!label])),las=1, horiz = TRUE)

#ratio of Nonmover and not applicable
prop.table(summary.factor(migration.code.change.reg[label]))[names(prop.table(summary.factor(migration.code.change.reg[label]))) %in% c("NA's", " Nonmover")]
##  Nonmover      NA's 
## 0.4216605 0.5284284
prop.table(summary.factor(migration.code.change.reg[!label]))[names(prop.table(summary.factor(migration.code.change.reg[!label]))) %in% c("NA's", " Nonmover")]
##  Nonmover      NA's 
## 0.4131484 0.4977691

This variable doesn’t very important we may ignore it at first.

Migration code-move within reg (ignored in model)

Metadata

10 distinct values for attribute #26 (migration code-move within reg) nominal

migration code-move within reg: Not in universe, Nonmover, Same county, Different county same state, Different state in West, Abroad, Different state in Midwest, Different state in South, Different state in Northeast.

Statistics

migration.code.move.reg <- gsub("?",NA,df[[28]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(migration.code.move.reg[label])),las=1, horiz = TRUE)

barplot(prop.table(summary.factor(migration.code.move.reg[!label])),las=1, horiz = TRUE)

#ratio of Nonmover and not applicable
prop.table(summary.factor(migration.code.move.reg[label]))[names(prop.table(summary.factor(migration.code.move.reg[label]))) %in% c("NA's", " Nonmover")]
##  Nonmover      NA's 
## 0.4216605 0.5284284
prop.table(summary.factor(migration.code.move.reg[!label]))[names(prop.table(summary.factor(migration.code.move.reg[!label]))) %in% c("NA's", " Nonmover")]
##  Nonmover      NA's 
## 0.4131484 0.4977691

This variable doesn’t very important we may ignore it at first.

Live in this house 1 year ago (ignored in model)

Metadata

3 distinct values for attribute #27 (live in this house 1 year ago) nominal

live in this house 1 year ago: Not in universe under 1 year old, Yes, No.

Statistics

live.in.this.house.1y.ago <- df[[29]]
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary(live.in.this.house.1y.ago[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(live.in.this.house.1y.ago[!label])),las=1, horiz = TRUE)

This variable doesn’t very important we may ignore it at first.

Migration prev res in sunbelt (ignored in model)

Metadata

4 distinct values for attribute #28 (migration prev res in sunbelt) nominal

migration prev res in sunbelt: Not in universe, Yes, No.

Statistics

migration.in.sunbelt <- gsub("?",NA,df[[30]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(migration.in.sunbelt[label]))[!names(prop.table(summary.factor(migration.in.sunbelt[label]))) %in% c("NA's")],las=1, horiz = TRUE)

barplot(prop.table(summary.factor(migration.in.sunbelt[!label]))[!names(prop.table(summary.factor(migration.in.sunbelt[!label]))) %in% c("NA's")],las=1, horiz = TRUE)

#ratio of not applicable
prop.table(summary.factor(migration.in.sunbelt[label]))[names(prop.table(summary.factor(migration.in.sunbelt[label]))) %in% c("NA's")]
##      NA's 
## 0.5284284
prop.table(summary.factor(migration.in.sunbelt[!label]))[names(prop.table(summary.factor(migration.in.sunbelt[!label]))) %in% c("NA's")]
##      NA's 
## 0.4977691

This variable doesn’t very important we may ignore it at first.

Num persons worked for employer

Metadata

7 distinct values for attribute #29 (num persons worked for employer) continuous

num persons worked for employer: continuous.

Statistics

num.persons.worked.employer <- df[[31]]
summary(num.persons.worked.employer[label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   4.004   6.000   6.000
summary(num.persons.worked.employer[!label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.821   4.000   6.000
plot(density(num.persons.worked.employer[label]))

plot(density(num.persons.worked.employer[!label]))

boxplot(num.persons.worked.employer[label], num.persons.worked.employer[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))

We can clearly see a difference between the distribution of people with an income greater than 50K.

Family members under 18 (ignored in model)

Metadata

5 distinct values for attribute #30 (family members under 18) nominal

family members under 18: Both parents present, Neither parent present, Mother only present, Father only present, Not in universe.

Statistics

family.members.under.18 <- as.factor(df[[32]])
par(mar=c(5.1,10,4.1,2.1))
barplot(prop.table(summary(family.members.under.18[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(family.members.under.18[!label])),las=1, horiz = TRUE)

This variable doesn’t very important we may ignore it at first, it is redundant with age and sone other variable because people with an income grater than 50 are mostly above 18.

Country of birth father (ignored in model)

Metadata

43 distinct values for attribute #31 (country of birth father) nominal

country of birth father: Mexico, United-States, Puerto-Rico, Dominican-Republic, Jamaica, Cuba, Portugal, Nicaragua, Peru, Ecuador, Guatemala, Philippines, Canada, Columbia, El-Salvador, Japan, England, Trinadad&Tobago, Honduras, Germany, Taiwan, Outlying-U S (Guam USVI etc), India, Vietnam, China, Hong Kong, Cambodia, France, Laos, Haiti, South Korea, Iran, Greece, Italy, Poland, Thailand, Yugoslavia, Holand-Netherlands, Ireland, Scotland, Hungary, Panama.

Statistics

country.birth.father <- gsub("?",NA,df[[33]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(country.birth.father[label]))[!names(prop.table(summary.factor(country.birth.father[label]))) %in% c("NA's")],las=1, horiz = TRUE)

barplot(prop.table(summary.factor(country.birth.father[!label]))[!names(prop.table(summary.factor(country.birth.father[!label]))) %in% c("NA's")],las=1, horiz = TRUE)

#ratio of not applicable
prop.table(summary.factor(country.birth.father[label]))[names(prop.table(summary.factor(country.birth.father[label]))) %in% c("NA's")]
##       NA's 
## 0.04433856
prop.table(summary.factor(country.birth.father[!label]))[names(prop.table(summary.factor(country.birth.father[!label]))) %in% c("NA's")]
##       NA's 
## 0.03293773

This variable doesn’t very important we may ignore it at first.

Country of birth mother (ignored in model)

Metadata

43 distinct values for attribute #32 (country of birth mother) nominal

country of birth mother: India, Mexico, United-States, Puerto-Rico, Dominican-Republic, England, Honduras, Peru, Guatemala, Columbia, El-Salvador, Philippines, France, Ecuador, Nicaragua, Cuba, Outlying-U S (Guam USVI etc), Jamaica, South Korea, China, Germany, Yugoslavia, Canada, Vietnam, Japan, Cambodia, Ireland, Laos, Haiti, Portugal, Taiwan, Holand-Netherlands, Greece, Italy, Poland, Thailand, Trinadad&Tobago, Hungary, Panama, Hong Kong, Scotland, Iran.

Statistics

country.birth.mother <- gsub("?",NA,df[[34]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(country.birth.mother[label]))[!names(prop.table(summary.factor(country.birth.mother[label]))) %in% c("NA's")],las=1, horiz = TRUE)

barplot(prop.table(summary.factor(country.birth.mother[!label]))[!names(prop.table(summary.factor(country.birth.mother[!label]))) %in% c("NA's")],las=1, horiz = TRUE)

#ratio of not applicable
prop.table(summary.factor(country.birth.mother[label]))[names(prop.table(summary.factor(country.birth.mother[label]))) %in% c("NA's")]
##       NA's 
## 0.03787756
prop.table(summary.factor(country.birth.mother[!label]))[names(prop.table(summary.factor(country.birth.mother[!label]))) %in% c("NA's")]
##       NA's 
## 0.03019114

This variable doesn’t very important we may ignore it at first.

Country of birth self (ignored in model)

Metadata

43 distinct values for attribute #33 (country of birth self) nominal

country of birth self: United-States, Mexico, Puerto-Rico, Peru, Canada, South Korea, India, Japan, Haiti, El-Salvador, Dominican-Republic, Portugal, Columbia, England, Thailand, Cuba, Laos, Panama, China, Germany, Vietnam, Italy, Honduras, Outlying-U S (Guam USVI etc), Hungary, Philippines, Poland, Ecuador, Iran, Guatemala, Holand-Netherlands, Taiwan, Nicaragua, France, Jamaica, Scotland, Yugoslavia, Hong Kong, Trinadad&Tobago, Greece, Cambodia, Ireland.

Statistics

country.birth.self <- gsub("?",NA,df[[34]], fixed = TRUE)
par(mar=c(5.1,13,4.1,2.1))
barplot(prop.table(summary.factor(country.birth.self[label]))[!names(prop.table(summary.factor(country.birth.self[label]))) %in% c("NA's")],las=1, horiz = TRUE)

barplot(prop.table(summary.factor(country.birth.self[!label]))[!names(prop.table(summary.factor(country.birth.self[!label]))) %in% c("NA's")],las=1, horiz = TRUE)

#ratio of not applicable
prop.table(summary.factor(country.birth.self[label]))[names(prop.table(summary.factor(country.birth.self[label]))) %in% c("NA's")]
##       NA's 
## 0.03787756
prop.table(summary.factor(country.birth.self[!label]))[names(prop.table(summary.factor(country.birth.self[!label]))) %in% c("NA's")]
##       NA's 
## 0.03019114

This variable doesn’t very important we may ignore it at first. the only factor we can find is redundant with the race hispanic.

Citizenship (ignored in model)

Metadata

5 distinct values for attribute #34 (citizenship) nominal

citizenship: Native- Born in the United States, Foreign born- Not a citizen of U S , Native- Born in Puerto Rico or U S Outlying, Native- Born abroad of American Parent(s), Foreign born- U S citizen by naturalization.

Statistics

citizenship <- do.call(rbind ,strsplit(as.character(df[[36]]), '-'))
native.citizen <- as.factor(citizenship[,1])
status.citizen <- as.factor(citizenship[,2])
par(mar=c(5.1,18,4.1,2.1))
barplot(prop.table(summary(native.citizen[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(native.citizen[!label])),las=1, horiz = TRUE)

barplot(prop.table(summary(status.citizen[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(status.citizen[!label])),las=1, horiz = TRUE)

This variable doesn’t very important we may ignore it at first.

Own business or self employed (ignored in model)

Metadata

3 distinct values for attribute #35 (own business or self employed) nominal

own business or self employed: 0, 2, 1.

Statistics

own.business.or.self.employed <- as.factor(df[[37]])
par(mar=c(5.1,6,4.1,2.1))
barplot(prop.table(summary(own.business.or.self.employed[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(own.business.or.self.employed[!label])),las=1, horiz = TRUE)

This variable doesn’t very important we may ignore it at first.

Fill inc questionnaire for veteran’s admin

Metadata

3 distinct values for attribute #36 (fill inc questionnaire for veteran’s admin) nominal

fill inc questionnaire for veteran’s admin: Not in universe, Yes, No.

Statistics

fill.inc.questionnaire.for.veterant.admin <- as.factor(df[[38]])
par(mar=c(5.1,8,4.1,2.1))
barplot(prop.table(summary(fill.inc.questionnaire.for.veterant.admin[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(fill.inc.questionnaire.for.veterant.admin[!label])),las=1, horiz = TRUE)

#ratio of not applicable
prop.table(summary(fill.inc.questionnaire.for.veterant.admin[label]))
##               No  Not in universe              Yes 
##      0.017363915      0.981343886      0.001292198
prop.table(summary(fill.inc.questionnaire.for.veterant.admin[!label]))
##               No  Not in universe              Yes 
##      0.007363432      0.990632731      0.002003837

Boolean yes/no

Veterans benefits

Metadata

3 distinct values for attribute #37 (veterans benefits) nominal

veterans benefits: 0, 2, 1.

Statistics

veterants.benefits <- as.factor(df[[39]])
par(mar=c(5.1,4,4.1,2.1))
barplot(prop.table(summary(veterants.benefits[label])),las=1, horiz = TRUE)

barplot(prop.table(summary(veterants.benefits[!label])),las=1, horiz = TRUE)

#ratio of not applicable
prop.table(summary(veterants.benefits[label]))
##          0          1          2 
## 0.00000000 0.01865611 0.98134389
prop.table(summary(veterants.benefits[!label]))
##           0           1           2 
## 0.253333048 0.009367269 0.737299683

Boolean 0 / 1 or 2

Weeks worked in year

Metadata

53 distinct values for attribute #38 (weeks worked in year) continuous

weeks worked in year: continuous.

Statistics

weeks.worked.in.year <- df[[40]]
summary(weeks.worked.in.year[label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   52.00   52.00   48.07   52.00   52.00
summary(weeks.worked.in.year[!label])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   21.53   52.00   52.00
plot(density(weeks.worked.in.year[label]))

plot(density(weeks.worked.in.year[!label]))

boxplot(weeks.worked.in.year[label], weeks.worked.in.year[!label], horizontal = TRUE, names=c("gt50000", "lt50000"))

We can clearly see a difference between the distribution of people with an income greater than 50K.

Year (ignored in model)

Metadata

2 distinct values for attribute #39 (year) nominal year: 94, 95.

Statistics

year <- as.factor(df[[41]])
sort(prop.table(summary(year[label])))
##        94        95 
## 0.4715716 0.5284284
sort(prop.table(summary(year[!label])))
##        95        94 
## 0.4977691 0.5022309

This variable doesn’t very important we may ignore it at first.

Summary

The column selected for the model are:

Age

Class of worker

Industrie code (0, !0)

Education

Wage per hour

Marital state

Major occupation code

Race boolean

Sex as a dummie variable " Male" / " Female"

Full or part time employment stat

Capital gains

Capital losses

Dividends from stocks

Tax filer stat

Detailed household and family stat as dummie variable " householders" or not

Num persons worked for employer

Fill inc questionnaire for veteran’s admin

Veterans benefits

Weeks worked in year